It is possible to build a fully automated, publicly available tool for exploring research trends.
The interactive nature of Streamlit allows for much deeper and more personalized exploration than a static chart.
Future Work
Performance: Implement a pre-computed inverted index for near-instant search results.
New Features:
Author-level analysis.
Shareable URLs to save and send specific views.
Advanced Analysis:
Use word embeddings for semantic search.
Code Explained!
The Goal: 01_ingest_data.py
This script is the first and most critical step in our automated data pipeline.
Its primary job is to:
Download the latest paper metadata from the NBER website.
Compare it against our local SQLite database.
Append only the new papers, ensuring our data is always up-to-date without duplication.
Configuration and Setup
A well-structured script starts with clear configuration. We define constants for our data source and local database paths. Using pathlib makes path manipulation clean and OS-agnostic.
import pandas as pdimport sqlite3from pathlib import Pathimport csv# --- Configuration ---# Define the base URL for NBER dataBASE_URL ="http://data.nber.org/nber_paper_chapter_metadata/tsv/"# Define project structure pathsPROJECT_ROOT = Path(__file__).resolve().parents[1]DB_PATH = PROJECT_ROOT /"data"/"03_primary"/"nber_papers.db"TABLE_NAME ="papers"
Downloading the Raw Data
Next, we fetch the data directly from the NBER’s TSV (Tab-Separated Values) files using pandas.
A try...except block makes our script resilient to network errors or changes in the source URL.
def ingest_and_update_db():"""..."""try:print("Downloading core reference data (ref.tsv)...") latest_df = pd.read_csv(f"{BASE_URL}ref.tsv", sep='\t', names=['paper', 'author', 'title', 'issue_date', 'doi'], header=0, engine='python', quoting=csv.QUOTE_NONE )print(f"Successfully loaded {latest_df.shape[0]} total records from ref.tsv.")
Note: The quoting=csv.QUOTE_NONE is important because the source TSV files have unquoted special characters that can break the default parser.
Merging with Abstracts
The metadata and abstracts are in separate files. We download the abstracts and use pd.merge to join them into a single, unified DataFrame. A left join ensures we keep all papers, even if an abstract is missing.
Real-world data is often messy. The issue_date column can contain invalid entries like ‘0000-00-00’.
We handle this gracefully:
Use pd.to_datetime with errors='coerce' to turn any invalid date formats into NaT (Not a Time).
Use .dropna() to remove these rows, ensuring high data quality downstream.
# --- MORE Data Cleaning ---print("Cleaning 'issue_date' column...")# Convert to datetime, coercing errors will turn invalid dates like '0000-00-00' into NaT latest_df['issue_date'] = pd.to_datetime(latest_df['issue_date'], errors='coerce')# Drop rows where the date conversion resulted in NaT (Not a Time) using reassignment latest_df = latest_df.dropna(subset=['issue_date'])
Checking the Existing Database
Before we add new data, we need to know what we already have. We connect to the SQLite database and read all existing paper IDs into a Python set for highly efficient lookups.
# 3. Connect to SQLite DB and get existing paper IDs existing_ids =set()try: conn = sqlite3.connect(DB_PATH)if pd.io.sql.table_exists(TABLE_NAME, conn):print(f"Database found at {DB_PATH}. Fetching existing paper IDs...") existing_ids_df = pd.read_sql(f"SELECT paper FROM {TABLE_NAME}", conn) existing_ids =set(existing_ids_df['paper'])print(f"Found {len(existing_ids)} existing papers in the database.")else:print("No existing database found. A new one will be created.") conn.close()exceptExceptionas e:# ...
Identifying New Papers
This is the key step for preventing duplicates. We use the power of pandas boolean indexing and the isin() method. This one line of code efficiently filters our downloaded DataFrame, keeping only the rows whose ‘paper’ ID is not in our existing_ids set.
If any new papers were found, we connect to the database again and append them. Using to_sql with if_exists='append' is crucial. It tells pandas to add the new rows to the existing table instead of creating a new one or erroring out.
ifnot new_papers_df.empty:print(f"Found {len(new_papers_df)} new papers to add.")try: conn = sqlite3.connect(DB_PATH) new_papers_df.to_sql( TABLE_NAME, conn, if_exists='append', index=False ) conn.close()print("Successfully appended new papers to the database.")
Executing the Script
Finally, the if __name__ == '__main__': block ensures the ingest_and_update_db function runs only when the script is executed directly. This is standard Python practice that allows functions from this script to be safely imported into other files without side effects.